Fix RemoteConversation FINISHED stop-hook race#3191
Conversation
The WS-driven termination path in `RemoteConversation._wait_for_run_completion` returned from `run()` on the first WebSocket terminal status. That raced the server-side stop-hook eval: when an `agent.step()` set `execution_status = FINISHED` and the lock was released at the end of one run-loop iteration, the next iteration could still flip status back to RUNNING via a stop hook that denied stopping (`hook_processor.run_stop()` returning rc=2). Clients observing the intermediate FINISHED returned, which then tore down the agent-server pod via the benchmark's `workspace_keepalive` exit. The agent never got to consume its iteration budget when hooks blocked. Empirical evidence (programbench retry-16, run 25497962453): - conv `dd86d184` (zip-password-finder): hook block at 13:30:10.936, agent did just 1 retry action (file_editor view) before pod teardown — far short of `max_iteration_per_run=200`. - conv `e4b6892e` (zoxide): hook block at 13:33:52.082, 0 retry actions — conversation died on the status revert event itself. - in-memory `history` snapshots ended at the FINISHED state update; on-disk `/workspace/conversations/.../events` had 5 extra events captured by the persist callback during the brief post-`run()` window. Fix: WS terminal status is a fast wakeup hint only — except for ERROR/STUCK, which are not subject to hook reverts and still terminate immediately. For FINISHED we fall through to the existing REST-poll path, which already requires `TERMINAL_POLL_THRESHOLD` consecutive terminal polls before returning. Any FINISHED→RUNNING revert seen on a REST poll naturally resets the consecutive-terminal counter, and we drain the WS hint queue at that point so we don't busy-spin on stale FINISHED notifications. Latency cost on the happy path (no hook block, default `poll_interval=1.0s`, `TERMINAL_POLL_THRESHOLD=3`): ~2s additional vs. the previous immediate- return on WS terminal — acceptable for benchmark and CLI workloads. Adds two regression tests: - `test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal`: seeds the WS queue with FINISHED while REST shows the post-hook RUNNING revert; asserts `run()` polls past the race window before confirming. - `test_remote_conversation_run_ws_error_still_terminates_immediately`: ensures the WS fast-path is preserved for ERROR (not subject to hook reverts). Refactors the queue-drain logic into `_drain_terminal_status_queue` and isolates the WS-immediate-terminal predicate into `_immediate_terminal_statuses`, both with docstrings explaining why FINISHED is excluded. Co-authored-by: openhands <openhands@all-hands.dev>
Persistent note for future sessions / agents debugging this surface so they don't have to re-derive the diagnosis from output.jsonl artifacts. Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste
This is an excellent fix for a subtle but critical race condition. The implementation is clean, well-documented, and properly tested.
Code Quality Highlights:
- Clear refactoring with
_drain_terminal_status_queue()and_immediate_terminal_statuses() - Excellent inline documentation explaining the WebSocket hint vs REST polling pattern
- Strong regression tests that directly reproduce the race scenario
- Proper handling of ERROR/STUCK as immediate terminal states
Process Note: This PR changes conversation termination behavior (agent behavior that could affect benchmarks). While the PR includes GitHub Actions links to successful ProgramBench runs, the custom review guidelines require either:
- Links to the eval monitor (
openhands-eval-monitor.vercel.app) showing completed benchmarks, and - Human maintainer confirmation of the eval results
Since these are GitHub Actions links (not eval monitor) and no human has confirmed the results in the PR, I'm flagging this for human review per the eval-risk policy.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
This is a correctness fix for a real production bug discovered in ProgramBench retry-16. The change makes the system more correct by treating WebSocket FINISHED events as wakeup hints rather than authoritative termination signals. This prevents premature client teardown when stop hooks veto completion.
Risk factors:
- ✅ Well-tested with regression coverage for both the race scenario and ERROR fast-path preservation
- ✅ Changes isolated to RemoteConversation termination logic
- ✅ Solves a real bug (not theoretical)
- ✅ Evidence of successful ProgramBench runs provided
⚠️ Could add minor latency to FINISHED termination (now requires REST polls), but this is the correct behavior
VERDICT:
✅ Recommend merging after human review: Core fix is sound and necessary for correct stop-hook behavior. The eval evidence shows ProgramBench now works correctly with this change.
KEY INSIGHT:
The separation of "immediate terminal statuses" (ERROR/STUCK) from "polled terminal statuses" (FINISHED) elegantly captures the semantic difference between non-revertable errors and potentially-vetoed agent completion.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified the RemoteConversation FINISHED stop-hook race fix works correctly by exercising the changed code and analyzing the implementation.
Does this PR achieve its stated goal?
Yes. This PR successfully fixes the race condition where WebSocket FINISHED events could cause RemoteConversation to terminate prematurely before server-side stop hooks finished evaluating whether the conversation should actually stop.
The fix correctly implements the behavior described in the PR: WebSocket FINISHED is now treated as a wake-up hint rather than authoritative termination, while ERROR and STUCK remain immediate terminal states. REST polling with a threshold of 3 consecutive terminal polls provides authoritative confirmation before the client considers the run complete.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Dependencies installed, project builds successfully |
| CI Status | ✅ All critical checks passing (sdk-tests, agent-server-tests, windows-tests, cross-tests) |
| Functional Verification | ✅ Code behavior matches specification, regression tests comprehensive |
Functional Verification
Verification Approach
Created a verification script (qa_simple_verification.py) to exercise the key aspects of the fix:
Test 1: FINISHED is a hint, not authoritative termination
Verification:
uv run python qa_simple_verification.pyResult:
Test 1: FINISHED via WebSocket is a hint, not authoritative termination
----------------------------------------------------------------------
Immediate terminal statuses: frozenset({'error', 'stuck'})
✅ PASS: FINISHED is NOT in immediate terminal statuses
✅ PASS: ERROR and STUCK are immediate terminal statuses
Testing _drain_terminal_status_queue behavior...
Queue size before drain: 3
Queue size after drain: 0
✅ PASS: Queue draining correctly empties all items
What this proves: The new _immediate_terminal_statuses() method correctly excludes FINISHED from immediate termination. Only ERROR and STUCK trigger the fast WebSocket termination path. The queue draining logic prevents stale FINISHED events from causing busy-spin loops.
Test 2: ERROR and STUCK remain immediate terminal states
Result:
Test 2: ERROR and STUCK terminate immediately (not hints)
----------------------------------------------------------------------
ERROR is_terminal: True
✅ PASS: ERROR is a terminal status
STUCK is_terminal: True
✅ PASS: STUCK is a terminal status
FINISHED is_terminal: True
✅ PASS: FINISHED is a terminal status (but treated as hint in WS path)
What this proves: The fix preserves the fast termination path for ERROR and STUCK (which are not subject to stop-hook reverts), while FINISHED remains a terminal status but is handled differently in the WebSocket code path.
Test 3: REST polling confirmation pattern
Verification: Inspected _wait_for_run_completion source code to verify:
Result:
Test 3: REST polling confirmation pattern
----------------------------------------------------------------------
✅ PASS: TERMINAL_POLL_THRESHOLD is set to 3 in _wait_for_run_completion
✅ PASS: REST polling requires TERMINAL_POLL_THRESHOLD consecutive polls
This prevents premature termination when stop hooks revert FINISHED → RUNNING
✅ PASS: Counter resets when status flips back to non-terminal (e.g., FINISHED → RUNNING)
This correctly handles stop hook denials
What this proves:
- The client requires 3 consecutive
FINISHEDREST polls before terminating - If status flips back to
RUNNING(stop hook denial), the counter resets - This prevents the race where WS
FINISHEDarrives before the server's stop hook evaluation completes
Test 4: Regression test coverage
Examined the added regression test test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal in tests/sdk/conversation/remote/test_remote_conversation.py:
Test scenario:
- Seeds the WS terminal-status queue with
FINISHED(simulating the broadcast) - Configures REST polling to return
RUNNINGfor first 3 polls (simulating stop hook revert) - Then returns
FINISHEDfor subsequent polls (agent finishes again after hook allows it) - Verifies client doesn't return immediately but waits for REST confirmation (≥6 polls)
What this proves: The regression test directly reproduces the race condition scenario observed in ProgramBench retry-16 and verifies the fix prevents premature termination.
CI Verification
Command:
gh pr view 3191 --repo OpenHands/software-agent-sdk --json statusCheckRollup,stateCI Status Summary:
- ✅ sdk-tests: SUCCESS
- ✅ agent-server-tests: SUCCESS
- ✅ windows-tests: SUCCESS
- ✅ cross-tests: SUCCESS
- ✅ tools-tests: SUCCESS
- ✅ workspace-tests: SUCCESS
- ✅ Python API: SUCCESS
- ✅ REST API (OpenAPI): SUCCESS
- ✅ pre-commit: SUCCESS
Issues Found
None.
Summary: The fix correctly addresses the stop-hook race condition by treating WebSocket FINISHED as a wakeup hint rather than authoritative termination. REST polling with a 3-poll threshold provides reliable confirmation that the server has actually finished (and stop hooks have been evaluated). The implementation is clean, well-tested, and preserves the fast-path for ERROR/STUCK states.
Co-authored-by: openhands <openhands@all-hands.dev>
|
[Automatic Post]: It has been a while since there was any activity on this PR. @neubig, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up. This comment was created by an AI agent (OpenHands) on behalf of the repository maintainers. |
Co-authored-by: openhands <openhands@all-hands.dev>
|
Merged |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable
Found one important regression-test gap: the new test does not actually inject the WS FINISHED event after run() drains stale queue items, so it would not fail against the old behavior. Also leaving this as COMMENT (not approval) because the PR changes run-loop behavior and should get maintainer/eval sign-off.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM — client/server run completion semantics affect benchmark/runtime correctness, but the implementation direction is straightforward.
This review was generated by an AI agent (OpenHands) on behalf of the user.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26368765094
Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed the regression-test gap in f3ab1a5 by injecting the WS FINISHED event after the |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable, but there is still a completion race to fix. One concrete non-inline issue: the final refresh_from_server() result is currently not enforced; if it returns running, _handle_conversation_status() returns False but _wait_for_run_completion() still reconciles and returns.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM — Remote run-completion semantics affect eval/runtime correctness; this should stay COMMENT until the remaining race is fixed and maintainer/eval sign-off happens.
Automated review generated by OpenHands on behalf of the user.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26369466261
Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed the latest OpenHands review in 4448c86: the final REST refresh must still be terminal before |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable, but there is still one terminal-status race to fix.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM — Remote run-completion semantics affect runtime/eval correctness, and this changes agent behavior, so this remains COMMENT for maintainer/eval sign-off.
VERDICT: ❌ Needs rework on the active-run ERROR/STUCK drain race.
This review was generated by an AI agent (OpenHands) on behalf of the user.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26369803053
Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed the active-run ERROR/STUCK drain race in e62448c: only stale |
Shorten line 69 comment that triggered ruff E501 (92 > 88 chars) introduced in the round-5 commit. Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
The fix is technically correct and the root cause analysis is thorough. Treating WebSocket FINISHED as advisory and requiring the post-run full-state snapshot as the authoritative completion signal cleanly closes the stop-hook race. The _run_armed gate preventing pre-run subscription snapshots from poisoning a future run() call is a nice touch. Two targeted comments below.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified the RemoteConversation FINISHED/stop-hook race behavior with actual SDK RemoteConversation.run() calls against a live local HTTP agent-server simulator; the PR fixes the early-return behavior and preserves ERROR fast-path handling.
Does this PR achieve its stated goal?
Yes. On origin/main, the same simulated server ordering caused run() to return after a WebSocket FINISHED even though the authoritative REST state had already reverted to running, reproducing the stop-hook race. On this PR, run() ignored the per-field FINISHED, observed the running veto window, and returned only after the post-run full-state snapshot; ERROR still raised immediately without polling.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully; no tests/linters were run. |
| CI Status | Build & Push variants + qa-changes), 3 skipped when checked. |
| Functional Verification | ✅ Before/after race reproduction and ERROR fast-path check passed. |
Functional Verification
Harness note: I used the public SDK constructors (RemoteWorkspace, Agent, RemoteConversation) and real HTTP requests to a local agent-server simulator. The websocket transport was replaced with a deterministic event injector so the exact server ordering from the reported race could be reproduced without needing an LLM-backed ProgramBench run.
Test 1: FINISHED stop-hook race
Step 1 — Reproduce baseline without the fix:
Ran git checkout -q origin/main; OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py finished_race || true:
Run completed via WebSocket notification (status: finished, elapsed: 0.1s)
{"elapsed_ticks_seconds": 0.15, "error": "", "outcome": "returned", "poll_count": 1, "posts_to_run": 1, "scenario": "finished_race", "seen_statuses": ["running"]}
This confirms the bug exists: after the WS FINISHED, the client returned even though the next REST state had already reverted to running, matching a stop hook veto window.
Step 2 — Apply the PR's changes:
Checked out fix/remote-conversation-finished-hint at 932a823ee27534897f6e56fd25bb95f4a163d0ce.
Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py finished_race:
Run completed via post-run WebSocket state update (status: finished, elapsed: 0.4s)
{"elapsed_ticks_seconds": 0.5, "error": "", "outcome": "returned", "poll_count": 5, "posts_to_run": 1, "scenario": "finished_race", "seen_statuses": ["running", "running", "running", "finished", "finished"]}
This shows the fix works: run() no longer treats per-field WS FINISHED as authoritative, waits through the running stop-hook veto window, and completes on the post-run full-state snapshot.
Test 2: ERROR remains an immediate terminal signal
Step 1 — Establish baseline behavior:
Ran git checkout -q origin/main; OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py error_fast_path || true:
{"elapsed_ticks_seconds": 0.15, "error": "ConversationRunError: Conversation run failed for id=9b21f108-8f7-4aa1-89bf-b6fea6dd241e: Remote conversation ended with error", "outcome": "raised", "poll_count": 0, "posts_to_run": 1, "scenario": "error_fast_path", "seen_statuses": []}
Baseline raised immediately on WS ERROR without REST polling.
Step 2 — Re-run with the PR:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py error_fast_path:
{"elapsed_ticks_seconds": 0.15, "error": "ConversationRunError: Conversation run failed for id=f8bf31b9-f806-483c-b0b8-d1eb4ebe4c40: Remote conversation ended with error", "outcome": "raised", "poll_count": 0, "posts_to_run": 1, "scenario": "error_fast_path", "seen_statuses": []}
This confirms the PR preserves the stated immediate terminal handling for ERROR.
Issues Found
None.
This review was created by an AI agent (OpenHands) on behalf of the user.
…rtion (round 6)
- Replace 'assert signal.from_post_run_snapshot' with explicit
'if not ... raise AssertionError(...)' (PRRT_kwDOPjFrIs6FUiPj) so the
invariant guard survives python -O and optimised builds.
- Strengthen test_remote_conversation_run_full_state_updates_cached_state
assertions (PRRT_kwDOPjFrIs6FUiPz): add exact cost check
(accumulated_cost == pytest.approx(1.25)) and key presence check
('test-llm' in usage_to_metrics) so the state-reconciliation path is
properly guarded against regressions.
Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Review: Fix RemoteConversation FINISHED stop-hook race
Summary
This is a well-designed and carefully-implemented fix for a subtle but genuine race condition. The root cause is clear: the server-side LocalConversation.run loop releases its state lock at the end of each iteration, so a FINISHED status set by agent.step() is observable to WebSocket clients before stop hooks run in the next iteration. Stop hooks can flip FINISHED back to RUNNING, making the WS-delivered FINISHED advisory rather than authoritative.
Correctness
The fix is correct. The key insight — treating WebSocket FINISHED as a hint and waiting for the server's post-run full-state snapshot as the authoritative termination signal — properly closes the race without degrading the ERROR/STUCK fast path. The _IMMEDIATE_TERMINAL_STATUSES set and _RunCompletionSignal.from_post_run_snapshot flag make the two-track behavior explicit and type-safe.
The _run_armed guard is elegant: clearing before the POST and setting after it ensures the initial WS subscription snapshot (delivered during connect()) can never be mistaken for the post-run completion signal. The finally: self._run_armed.clear() cleanup means the guard is always restored regardless of how _wait_for_run_completion exits.
REST Hard Fallback
The 30-second time-based hard fallback (TERMINAL_HARD_FALLBACK_SECS) is a sensible safety valve for the case where the WebSocket drops after the server finishes but before the post-run full-state snapshot is delivered. Hardcoding 30 s is reasonable given the stated rationale (stop hooks are fast; 30 s exceeds any reasonable hook execution time).
One nuance: the terminal_first_seen_at timer resets on every REST poll exception (see inline comment). In a degenerate scenario where the server alternates between reporting terminal status and throwing HTTP errors, the 30 s threshold could keep sliding indefinitely with timeout=None. Noted inline for awareness — not a blocker.
Test Coverage
The regression suite is thorough:
test_remote_conversation_run_waits_for_post_run_full_state— verifies the client does not return on the first REST-visible FINISHED without a WS snapshot.test_remote_conversation_run_stale_pre_run_snapshot_is_ignored— verifies_run_armedrejects pre-run WS snapshots.test_remote_conversation_run_ws_error_still_terminates_immediately— confirms ERROR short-circuits without waiting for REST.test_remote_conversation_run_rest_hard_fallback_when_ws_silent— exercises the 30 s fallback viatime.monotonicpatching.test_remote_conversation_run_keeps_error_queued_after_running_poll— confirms ERROR is not discarded by a racing non-terminal REST poll.
The install_post_run_full_state helper is well-documented (the comment explaining why it fires from GET rather than POST — to avoid racing _run_armed.set() — is particularly valuable).
Non-Core Changes
The programbench additions across skill files, the workflow, and manage_evals.py are clean mechanical updates. The AGENTS.md <KNOWN_RACES_AND_GOTCHAS> section is high-quality institutional knowledge that will save the next developer who encounters anything adjacent to this race.
Verdict
✅ The fix is correct, the tests are solid, and the documentation is exemplary. No blocking concerns.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
RemoteConversation now waits through a simulated FINISHED→RUNNING stop-hook race and completes only after the authoritative post-run full-state snapshot.
Does this PR achieve its stated goal?
Yes. I reproduced the pre-fix behavior on origin/main: RemoteConversation.run(blocking=True) returned from a WebSocket execution_status=finished hint even though the simulator's next authoritative REST status was running, matching the stop-hook race described in the PR. On this PR, the same user-facing RemoteConversation run ignored that FINISHED hint, observed three running polls, then returned only after the post-run full-state snapshot; I also verified WebSocket ERROR still raises immediately without REST polling.
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv run created project environments and imported the SDK successfully. |
| CI Status | ✅ 31 checks successful; only this qa-changes check was still in progress; skipped cleanup/artifact checks noted. |
| Functional Verification | ✅ Before/after RemoteConversation execution confirms the race is fixed and ERROR remains immediate. |
Functional Verification
Test 1: FINISHED WebSocket hint during stop-hook revert
Step 1 — Reproduce baseline without the fix:
Ran from an origin/main worktree:
cd /tmp/sdk-qa-main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_finished_race.py --scenario raceKey output:
Run completed via WebSocket notification (status: finished, elapsed: 0.0s)
SCENARIO=race
OUTCOME=returned
ELAPSED_SECS=0.01
REST_STATUS_POLLS=1
RUNNING_REVERT_SEEN=True
POST_RUN_FULL_STATE_SENT=False
HTTP_SEQUENCE=...,POST /api/conversations/.../run,GET /api/conversations/...
This confirms the bug exists on main: the client returned from the WS finished hint while the simulator's REST state had already reverted to running, and before any post-run full-state snapshot was sent.
Step 2 — Apply the PR's changes:
Used the checked-out PR branch fix/remote-conversation-finished-hint at commit 7c51f00439bfea78797ee0025b55cb9653646fd4.
Step 3 — Re-run with the fix in place:
OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_finished_race.py --scenario raceKey output:
Run completed via post-run WebSocket state update (status: finished, elapsed: 0.2s)
SCENARIO=race
OUTCOME=returned
ELAPSED_SECS=0.22
REST_STATUS_POLLS=4
RUNNING_REVERT_SEEN=True
POST_RUN_FULL_STATE_SENT=True
HTTP_SEQUENCE=...,POST /api/conversations/.../run,GET /api/conversations/...,GET /api/conversations/...,GET /api/conversations/...,GET /api/conversations/...,GET /api/conversations/.../events/search
This shows the fix works: run() did not stop at the per-field FINISHED hint, survived the simulated stop-hook running reverts, and returned only after the authoritative post-run full-state snapshot.
Test 2: WebSocket ERROR remains immediate
Ran on the PR branch:
OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_finished_race.py --scenario ws-errorKey output:
SCENARIO=ws-error
OUTCOME=raised
ERROR=ConversationRunError: Conversation run failed for id=...: Remote conversation ended with error
ELAPSED_SECS=0.02
REST_STATUS_POLLS=0
POST_RUN_FULL_STATE_SENT=False
This confirms the PR preserves the stated fast path for ERROR: it raises immediately from the WebSocket update without waiting for REST confirmation.
Issues Found
None.
This QA review was created by an AI agent (OpenHands) on behalf of the user.
…rationale (round 7) Add comment at the exception handler explaining why terminal_first_seen_at is reset on any REST poll error (PRRT_kwDOPjFrIs6FUzqe): we cannot confirm the server is still in a terminal state after a failed poll, so we conservatively restart the hard-fallback timer to prefer a false-negative wait over a false-positive early return. Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste
This is a well-executed fix for a real and subtle race condition. The PR has been through multiple review iterations and the current HEAD (b5bec8c) is in clean shape.
Root cause & fix correctness
The race is real: the server-side LocalConversation.run loop releases its state lock at the end of each iteration, so a FINISHED status set by agent.step() is observable by WS clients before the next iteration's stop hooks run. A stop hook returning rc=2 flips the status back to RUNNING — but the old client already exited. The fix correctly identifies the authoritative signal: not the per-field WS FINISHED event, but the server's post-run full-state ConversationStateUpdateEvent published after conversation.run()/arun() exits and pending events flush.
Three-part termination design
_IMMEDIATE_TERMINAL_STATUSES(ERROR, STUCK) — per-field WS updates are still immediately terminal for these; stop hooks cannot revert them.- Per-field WS FINISHED — excluded from
_IMMEDIATE_TERMINAL_STATUSES; treated as advisory only and never enqueued for the fast-path. - Post-run full-state WS snapshot — authoritative; gated by
_run_armed(set after POST /run, cleared infinally) to exclude the initial subscription snapshot that arrives duringconnect().
This is exactly the right separation of concerns.
Invariant guard at line 1153
The final iteration's fix — replacing assert signal.from_post_run_snapshot with an explicit raise AssertionError(...) — is correct. The assert form is silenced by python -O and some packaging tools; raise AssertionError(...) is not. The guard is currently theoretically unreachable (ERROR/STUCK raise in _handle_conversation_status before reaching this line, so only from_post_run_snapshot=True signals arrive), but documents the invariant and will catch regressions if _IMMEDIATE_TERMINAL_STATUSES is extended with a status that doesn't raise.
Hard fallback
TERMINAL_HARD_FALLBACK_SECS = 30.0 is the right safety valve for WS delivery failures (dropped socket after run finishes on the server). The timer resets on poll errors, which correctly prefers false-negative waits over false-positive early returns. 30 s is a generous but reasonable bound — stop hooks are fast; a hook revert would flip REST back to RUNNING within milliseconds, well before the threshold.
Tests
The new test cases exercise real code paths and assert on outputs:
test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal— the core regression: WS-delivered FINISHED after/runtrigger does not completerun().test_remote_conversation_run_rest_finished_revert_waits_for_full_state— REST FINISHED→RUNNING revert does not prematurely return.test_remote_conversation_run_stale_pre_run_snapshot_is_ignored—_run_armedguard works end-to-end.test_remote_conversation_run_ws_error_still_terminates_immediately— ERROR via WS still fast-paths (poll_interval=10s would make a fall-through take 30+ s).test_remote_conversation_run_rest_hard_fallback_when_ws_silent— hard fallback activates when WS snapshot is never delivered.install_post_run_full_statehelper fires the WS event from the GET side effect (after_run_armed.set()) rather than from the POST side effect, correctly avoiding the arming race.
Evidence
ProgramBench CI links are present and passing (parent eval job, downstream inference phase, downstream eval harness phase).
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM — Remote run-completion semantics are core infrastructure. The change is narrowly targeted, all previously raised concerns are resolved in the current HEAD, and regression test coverage is solid. The 30 s hard fallback prevents indefinite hangs if WS delivery fails. Recommend merging after final maintainer sign-off.
VERDICT:
✅ Worth merging: The fix is correct, the invariant guard is robust, tests are real, and the evidence is present. All concerns from previous review rounds are resolved.
KEY INSIGHT:
Per-field WS FINISHED is advisory; only the server's post-run full-state snapshot is authoritative, because stop hooks can revert FINISHED before the server's run loop exits.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing. See the customization docs for the required frontmatter format.- Re-request a review — the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
Was this review helpful? React with 👍 or 👎 to give feedback.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
The RemoteConversation race fix was functionally verified with a real SDK client talking to a local HTTP/WebSocket server that reproduced the FINISHED stop-hook race.
Does this PR achieve its stated goal?
Yes. On origin/main, RemoteConversation.run(blocking=True) returned immediately after a per-field WebSocket FINISHED hint while REST still reported running, reproducing the premature-return bug. On PR commit b5bec8c7, the same scenario waited through the reverted running statuses and returned only after the post-run full-state WebSocket snapshot reported finished; a separate run also confirmed WebSocket ERROR still raises immediately.
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv run created/synced the project environment and imported the SDK successfully. |
| CI Status | 🟡 At check time: 31 successful, 5 skipped, and qa-changes in progress. |
| Functional Verification | ✅ Reproduced the old race, verified the fixed behavior, and checked the ERROR fast path. |
Functional Verification
Test 1: FINISHED WebSocket hint vs stop-hook revert race
Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_race.py:
{
"log": "Run completed via WebSocket notification (status: finished, elapsed: 0.0s)",
"run_returned": true,
"elapsed_seconds": 0.003,
"run_get_count": 1,
"statuses_returned": ["running"],
"final_cached_status": "running"
}This confirms the bug exists: the old client returned from run() on the per-field WebSocket finished hint even though the server's authoritative REST state had already reverted to running.
Step 2 — Apply the PR's changes:
Checked out PR commit b5bec8c723e0415a3b738ef91faea7f0e2842458.
Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_race.py:
{
"log": "Run completed via post-run WebSocket state update (status: finished, elapsed: 0.2s)",
"run_returned": true,
"elapsed_seconds": 0.211,
"run_get_count": 4,
"statuses_returned": ["running", "running", "running", "finished"],
"final_cached_status": "finished"
}This confirms the fix works: the PR client ignored the unsafe per-field FINISHED hint, observed the server remain running for three polls, and returned only after the post-run full-state snapshot.
Test 2: WebSocket ERROR remains immediate
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_error.py on the PR commit:
{
"raised": "ConversationRunError",
"elapsed_seconds": 0.004,
"run_get_count": 0,
"message_contains_error": true
}This confirms the PR preserves the intended immediate ERROR behavior: run() raised before any REST polling despite a 10-second poll interval.
Issues Found
None.
This QA review was generated by an AI agent (OpenHands) on behalf of the user.
Co-authored-by: openhands <openhands@all-hands.dev>
|
Pushed a minimal simplification in 6808b8a: RemoteConversation now treats only the post-run full-state WebSocket snapshot as authoritative completion and lets REST polling handle error/stuck fallback. Ready for another automated look.\n\nThis comment was generated by an AI agent (OpenHands) on behalf of neubig. |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
The FINISHED stop-hook race is fixed in the exercised RemoteConversation path, but QA found a related regression in immediate WebSocket ERROR handling.
Does this PR achieve its stated goal?
Yes for the primary goal. Using a short SDK harness that calls RemoteConversation.run() against a fake remote server, the base branch returned after a per-field WebSocket FINISHED while the server was still reporting RUNNING; the PR ignored that hint and waited until the post-run full-state snapshot arrived. However, the same exercise shows per-field WebSocket ERROR is no longer surfaced immediately, despite the PR description saying ERROR/STUCK remain immediate terminal states.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed and installed the uv development environment |
| CI Status | 🟡 Most checks passing; qa-changes and several image build jobs were still pending when checked |
| Functional Verification |
Functional Verification
Test 1: FINISHED stop-hook race behavior
Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout --quiet origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py finished-race:
5e614afe
Run completed via WebSocket notification (status: finished, elapsed: 0.0s)
finished_race_rest_polls=1
This confirms the bug exists: the client accepted the per-field WebSocket FINISHED and returned while the fake server was still in the first REST RUNNING poll that represents the stop-hook veto window.
Step 2 — Apply the PR's changes:
Checked out PR commit 6808b8a8a4958ec8d762243f5152d125265500b9.
Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py finished-race:
6808b8a8
REST has reported terminal status finished for 3 polls without a post-run WebSocket snapshot; continuing to wait for the authoritative snapshot (elapsed: 0.1s)
Run completed via post-run WebSocket state update (status: finished, elapsed: 0.1s)
finished_race_rest_polls=6
This shows the fix works for the PR's main goal: run() ignored the early per-field FINISHED, continued through the simulated RUNNING revert, and returned only after the authoritative full-state snapshot.
Test 2: ERROR remains immediate terminal behavior
Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout --quiet origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py per-field-error:
per_field_error_result=ConversationRunError after 0.00s; REST polls=0; message=Conversation run failed for id=2d30afb2-ff95-46e8-a902-a14442d222d9: Remote conversation ended with error
This shows the prior client behavior: a per-field WebSocket ERROR was surfaced immediately without waiting for REST polling.
Step 2 — Apply the PR's changes:
Checked out PR commit 6808b8a8a4958ec8d762243f5152d125265500b9.
Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py per-field-error:
per_field_error_result=ConversationRunError after 0.26s; REST polls=5; message=Conversation run failed for id=b835439a-8ef7-4acf-954e-c109e1980e76: Run timed out after 0.25 seconds. The conversation may still be running on the server.
This shows a regression in the stated non-FINISHED behavior: the WebSocket ERROR was ignored and the caller only got a timeout while REST kept reporting RUNNING.
Issues Found
- 🟠 Issue: Per-field WebSocket
ERROR/STUCKupdates no longer appear to be immediate terminal signals; a user can now wait until polling/timeout if REST remains stale.
This review was created by an AI agent (OpenHands) on behalf of the user.
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste
This is a well-executed fix for a real, production-observed race condition. The root cause analysis is accurate: the server-side LocalConversation.run loop releases its state lock at iteration end, so a per-field FINISHED WS event can arrive before stop hooks run in the next iteration. The architectural fix — treating per-field status updates as advisory and requiring the post-run full-state WS snapshot as the authoritative completion signal — is the correct approach.
The _run_armed (threading.Event) mechanism elegantly prevents the initial WS subscription snapshot from being mistaken for a post-run completion signal by gating the callback before the POST /run call. The 30-second time-based TERMINAL_HARD_FALLBACK_SECS is a well-reasoned safety valve: stop hooks complete in seconds, so 30 s provides a strong signal that the WS snapshot was lost without introducing a painful hang. Tests are comprehensive, exercise real code paths, and include a patch of time.monotonic to deterministically trigger the hard-fallback path — well above the bar for regression coverage.
[IMPROVEMENT OPPORTUNITIES]
AGENTS.md note references non-existent symbols
The newly added <KNOWN_RACES_AND_GOTCHAS> note says "the REST poll path with TERMINAL_POLL_THRESHOLD consecutive terminal polls" and "See _immediate_terminal_statuses". Neither symbol exists in the current implementation:
- The count-based
TERMINAL_POLL_THRESHOLDwas replaced by the time-basedTERMINAL_HARD_FALLBACK_SECS = 30.0. _immediate_terminal_statusesis not a method onRemoteConversation; ERROR/STUCK are handled inline inside_handle_conversation_status.
A future contributor reading AGENTS.md and then grepping the codebase for these symbols will find nothing. Suggest updating the sentence to reference TERMINAL_HARD_FALLBACK_SECS and _handle_conversation_status.
_handle_conversation_status return value is never consumed
The method signature is -> bool and its docstring says "return True if the run is complete." However, both call sites (WS path and REST path in _wait_for_run_completion) discard the return value. Callers rely entirely on the exception-raise side effect for ERROR/STUCK. Either drop the return type to None, or add a callsite that consumes it. As written it creates a false impression that the bool controls downstream logic.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW — the change tightens run-completion semantics and is covered by 32 passing tests plus end-to-end ProgramBench CI evidence cited in the PR description. The eval-infrastructure changes (addingprogrambenchto benchmark lists) are purely additive. The only rollback risk is if a deployment environment consistently drops the post-run WS full-state snapshot, in which case the 30 s hard fallback adds latency; that scenario is preferable to the previous premature-termination bug.
VERDICT:
✅ Worth merging: Core logic is sound, tests are solid, real production evidence provided. The AGENTS.md stale references and unused return type are polish items, not blockers.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better by adding a
.agents/skills/custom-codereview-guide.mdfile to your branch with the/codereviewtrigger and context the reviewer is missing.Was this review helpful? React with 👍 or 👎 to give feedback.
Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified the RemoteConversation stop-hook race behavior with a functional SDK harness: the PR no longer returns on an early WebSocket FINISHED hint and still raises immediately on WebSocket ERROR.
Does this PR achieve its stated goal?
Yes. The stated goal is to prevent RemoteConversation.run(blocking=True) from treating a per-field WebSocket FINISHED update as authoritative when a server-side stop hook may revert the run to RUNNING. In the baseline run, the client returned immediately after the early WS FINISHED and before the simulated post-run full-state snapshot (delivered_full_state: false). With the PR applied, the same user-facing RemoteConversation.run(...) call stayed active through the simulated stop-hook veto window and returned only after the authoritative full-state snapshot (poll_count: 5, delivered_full_state: true).
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully with uv sync --dev |
| CI Status | ⏳ 19 successful, 8 pending, 3 skipped, 0 failing at last check |
| Functional Verification | ✅ RemoteConversation race fix and immediate WS error path exercised |
Functional Verification
Test 1: RemoteConversation.run() must ignore early WS FINISHED hints until post-run full state
Step 1 — Reproduce / establish baseline without the fix:
Ran git checkout --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_finished_race.py.
The harness uses the public SDK RemoteConversation.run(blocking=True) against a local HTTP remote-conversation harness that behaves like the problematic server timeline: POST /run emits a per-field WS execution_status=finished, then REST reports running for the stop-hook veto window, then a post-run full-state snapshot is delivered later.
Observed baseline excerpt:
{
"outcome": "returned",
"error": null,
"poll_count": 1,
"delivered_full_state": false,
"elapsed_seconds": 0.003
}Log excerpt:
Run completed via WebSocket notification (status: finished, elapsed: 0.0s)
This confirms the bug: the client returned on the early WS FINISHED before the simulated server completed stop-hook evaluation and before the authoritative full-state snapshot arrived.
Step 2 — Apply the PR's changes:
Checked out fix/remote-conversation-finished-hint at 384877e691edde70dcb8b2b40afa08e0cdedc8be.
Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_finished_race.py.
Observed PR excerpt:
{
"outcome": "returned",
"error": null,
"poll_count": 5,
"delivered_full_state": true,
"elapsed_seconds": 0.26
}Log excerpt:
Run completed via post-run WebSocket state update (status: finished, elapsed: 0.3s)
This confirms the fix works: the client ignored the early per-field FINISHED, continued polling through the simulated RUNNING stop-hook veto window, and returned only after the full-state WebSocket update.
Test 2: WebSocket ERROR remains immediate
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_ws_error.py on the PR branch. The harness emits a WS execution_status=error immediately after POST /run while REST would otherwise report running if polled.
Observed excerpt:
{
"outcome": "raised",
"error": "ConversationRunError: Conversation run failed for id=...: qa injected error",
"poll_count": 0,
"elapsed_seconds": 0.004
}This confirms the PR preserves the intended immediate terminal handling for error states; it raised without waiting for REST polling.
Issues Found
None.
This review was created by an AI agent (OpenHands) on behalf of the user.
all-hands-bot
left a comment
There was a problem hiding this comment.
Solid fix for a subtle but real race. The core design — treating per-field WS FINISHED as advisory and only accepting full-state snapshots as authoritative completion signals — is the right call, and the _run_armed gate elegantly prevents the initial subscription snapshot from masquerading as a post-run signal. Test coverage is thorough, including the regression scenario, ERROR short-circuit, hard-fallback path, and stale-snapshot guard.
Two minor observations below; neither is a blocker.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Graham Neubig <398875+neubig@users.noreply.github.com> Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Graham Neubig <398875+neubig@users.noreply.github.com> Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Summary
Fixes a RemoteConversation race where a WebSocket
FINISHEDevent could cause the remote client to stop waiting before server-side stop hooks finished evaluating whether the conversation should actually terminate.Before this change,
RemoteConversationtreatedFINISHEDfrom the WebSocket stream as authoritative. That is unsafe when stop hooks can veto completion and move the conversation back toRUNNING. The client could exit early, collect outputs, and proceed while the server-side conversation was still active.After this change:
ERRORandSTUCKremain immediate terminal states.FINISHEDis treated as a wake-up hint.Motivation
This was discovered during ProgramBench end-to-end CI. ProgramBench relies on stop hooks for compile-contract and reference-output validation. When a stop hook rejects completion, the agent must keep working; early client teardown can produce premature submissions.
Successful ProgramBench CI evidence that exercised this path:
Changes
openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.pyFINISHEDas advisory and confirms final state through REST polling.tests/sdk/conversation/remote/test_remote_conversation.pyFINISHED-vs-stop-hook race.AGENTS.mdVerification
uv run pytest tests/sdk/conversation/remote/test_remote_conversation.py -q32 passedStacking
This PR is split out from #3075 so the critical conversation-termination behavior change can receive focused human review. The ProgramBench workflow/docs PR should be stacked on top of this branch and contain only the benchmark-choice/doc changes.
This PR was created by an AI agent (OpenHands) on behalf of the user.
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:384877e-pythonRun
All tags pushed for this build
About Multi-Architecture Support
384877e-python) is a multi-arch manifest supporting both amd64 and arm64384877e-python-amd64) are also available if neededIssue
Resolves #3294